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Abstract 

Complex network theory is used to investigate the structure of meaningful con- 
cepts in written texts of individual authors. Networks have been constructed after 
a two phase filtering, where words with less meaning contents are eliminated, and 
all remaining words are set to their canonical form, without any number, gender 
or time flexion. Each sentence in the text is added to the network as a clique. A 
large number of written texts have been scrutinized, and its found that texts have 
small-world as well as scale-free structures. The growth process of these networks 
has also been investigated, and a universal evolution of network quantifiers have 
been found among the set of texts written by distinct authors. Further analyzes, 
based on shuffling procedures taken either on the texts or on the constructed net- 
works, provide hints on the role played by the word frequency and sentence length 
distributions to the network structure. Since the meaningful words are related to 
concepts in the author's mind, results for text networks may uncover patterns in 
communication and language processes that occur in the mind. 



1 Introduction 



Concepts of complex networks have proven to be powerful tools in the analysis 
of complex systems [1,2,3,4,5]. They have been applied to modelling purposes 
as well as to search for properties that naturally emerge in actual systems 
due to their large-scale structure. Unlike random graphs, complex networks 
reveals ordering principles related to their topological structure. This way, if 
complex systems are mapped onto networks, it is possible to use their con- 
ceptual framework to identify and even to explain features that seem to have 
universal character. 
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Several complex networks have been proposed in the scientific literature as- 
sociated with real systems: the biological food wcb[6], technological commu- 
nication networks as the Internet, information networks as the World Wide 
Web[7], social networks defined by friendship relations among individuals, etc. 
[8]. 

Word networks have been used to address complex aspects of human language. 
In such studies, words are connected according either to semantic associations 
[11,12,13] or even by nearness in the text [14,15,16], i.e., based on what is 
commonly called word window with a fixed number of words. Those works 
intend to establish the structure of a given language as a whole. Because of 
this, they deal with a huge amount of texts, independently of their authors, 
what is called corpora. 

Language is obviously a product of brain activity and, to this regard, we would 
like to call the attention to works revealing the presence of neuronal networks 
in the brain by direct physiological measurements. In a recent work[17], func- 
tional magnetic resonance of human brains has established networks whose 
vertices are regions of the cerebral cortex that are activated by external stim- 
uli according to a temporal correlation of activity. 

In this work, we investigate the relations between the concepts in individual 
written texts, by using them as starting point to construct significant networks. 
Projecting both the concepts present in the text, as well as the way they are 
related among them, onto a network gives the opportunity to use the tools 
and concepts developed within the network framework to characterize, in a 
quantitative way, how the concepts in a written text appear, how ordered and 
connected they are, how close to each other they are within the text, and so 
on. 



2 Network concepts and text network construction 

An undirected network is defined by a set of elements, called vertices or nodes, 
represented graphically by points, some of which are joined together by an 
edge, represented by a line between these points. The topological structure of 
many networks displays a large-scale organization, with some typical proper- 
ties like: highly sparse connectivity, small minimal paths between vertices and 
high local clustering. 

To analyze a network, a number of indices (or quantifiers) have been proposed, 
which allow for a quantification of many of the properties quoted above. In 
this work we will use the most important of them: number of vertices N, 
number of edges M, average connectivity < k >, average minimal path length 
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diameter L, degree distribution p{k) , and the Watts and Strogatz clustering 
coefficient C (for a thorough discussion of these indices see, e.g., [5]). If the 
node degree distributions follow power laws, p{k) ~ k~'^, the exponent 7 
becomes an important index for the network characterization. 

Two important network scenarios, characterized respectively by the small- 
world [9] and scale-free structures [10], have been identified to be present in 
several complex networks, including those related to the subject of this work, 
namely the analysis of corpora and of brain activity. The first one is related 
to the concomitant presence of small average value for i and high C, while the 
second is related to a power law distribution of node degree. 

To map the texts onto networks that reflect the structure of meaning concepts 
within it, we have preserved only the words with an intrinsic meaning, elim- 
inating words which have a merely grammatical function, as to arrange the 
syntactical structure of sentences in the text (articles, pronouns, prepositions, 
conjunctions, abbreviations, and interjections). Afterwards, we have reduced 
the remaining words to their canonical forms, i.e., we have disregarded all in- 
flections as plural forms, gender variations, and verb forms. Such procedures 
are common in studies on language[18]. In order to perform a computer imple- 
mentation, we have used some routines, dictionaries, and grammatical rules 
from UNITEX package [19]. Unknown words to the program were preserved 
in their original forms. After this filtering treatment, we have constructed a 
network for each individual text, where each distinct word corresponds to a 
single vertex. Since we consider the sentence as the smaller significant unit of 
the discourse, two words are connected if they belong to the same sentence. 
Thus, sentences are incorporated into the network as a complete subgraph, 
that is, a clique of mutually connected words. Sentences with common words 
are connected by the shared words. The method we propose offers a natural 
way to analyze the growth process of the network evolution. We have investi- 
gated both the network behavior in this step by step evolving process, what 
we call dynamical analysis, as well as the behavior of the entire network in its 
final form, corresponding to the entire text, called the static analysis. Both 
methods reveal important properties and they shall be discussed further on. 
In Figure 1, we show an example of a text filter process together with the 
evolving process of network construction for a well known jingle. 

In order to guarantee an uniform and comprehensive sample, we have chosen 
a collection of 312 texts [20], which can be cast into different classes as: genre 
(59% technical, 41% literary), language (53% in Portuguese, 47% in English), 
gender (72% male authors, 28% female authors). Finally wc have also classified 
the texts according to their size (55% with less than 1,000 sentences, 45% with 
more than 1,000). The smaller text has 169 and the larger, 276,425 words; in 
the average, the texts have 32,691 words. 
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Fig. 1. The filtering treatment and the growth process of the network for a small 
poem. 




Fig. 2. Degree distributions for a very large and the smallest text. 
3 Results 

The texts were individually analyzed, based on the evaluation of the indices 
quoted in the previous section: A^, M, </c>,^, C, L, and degree distribution 
which we have found to obey a power law ~ . We have also 
followed how the number and size of independent clusters varies in the growth 
process. Average values of such indices based on the entire texts of the whole 
sample have been computed and are shown in Table I. The analyzed networks 
are very sparse, with average connection index 2M/[N{N — 1)] < 0.1 for the 
networks with N > 100 . 

We have verified that the degree distributions, for all analyzed texts, exhibit an 
early maximum around /c = 10, which is followed by crossover to a long right- 
skewed tail, approximating a power law distribution with average exponent 
(7) = 1.6 ± 0.2. We have found no evidence of correlation between the value 
of 7 and the length of the analyzed text. These features are illustrated in 
Figure 2, where we show p{k) x k for a very large text and the smallest one 
we analyzed. 

The dynamical analysis also reveals very interesting results. In Figure 3(a), 
points indicate the evolution of C for 15 large texts, as function of the network 
sizes, showing that they converge to a value around 0.75. We have identified 
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the data of two specific texts (in color) to emphasize such phenomenon. This 
pecuhar behavior of C is also exhibited when we plot its value as function of 
the number of sentences s in the different texts, as shown in Figure 3(b). A 
functional dependence between C and s can be written as C ~ 1 — O.OSiog'ioS. 

The investigation of the cluster structure of the network during its growth has 
shown that texts evolve mostly in the form of a very large single cluster, which 
is formed since the very beginning of the network evolution. New words are 
likely to adhere to it rather than starting other significant clusters, what leads 
to the presence of a single giant cluster for the entire text. Another aspect 
we have analyzed was the evolution of the indices in the network growth 
process with respect to the same text in different languages. As an example 
we compare, in Figure 4, the evolution of C and I for the network associated 
to James Joyce's Ulysses, in the original Enghsh version and in its translation 
to Portuguese. 

A partial discussion of the results we have obtained to this point indicates that 
the networks we analyzed have highly sparse connectivity, small D and ^, but 
high C, what constitute evidences of a small-word network scenario. Besides, 
since p{k) decay according to power laws, we conclude they also behave as 
scale- free networks (see values in Table I). It is important to emphasize that 
the filtering treatment does not modify the network general behavior. In order 
to verify this fact, we have also performed the same measurements for the 
networks obtained from the original texts, without any kind of treatment. 
We have verified that, notwithstanding to the fact that we obtain different 
values for the same indices (see Table I), the very frequent presence of articles, 
prepositions and other words does not alter the small-world and scale-free 
character of the text networks! The small-world behavior is characterized by 
a great compactness and by the presence of some vertices with high local 
clustering. Such aspect in these networks means that they have words that are 
often used along the text. That fact explains the small values of D and /. The 
scale-free behavior results from the presence of nodes with high connectivity 
degree in a much greater amount in comparison to that of a random graph. 
This aspect is obviously expected in a written text, due to the presence of an 
organization principle which controls the construction of the text. The most 
connected words are associated with recurrent concepts around which the text 
is constructed. 



4 Shuffling procedures 

There are several agents that contribute to determine the measured indices 
in the analyzed networks. For example, since each sentence is added to the 
network as a complete subgraph, their lengths may interfere in the cluster- 
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Fig. 3. (a) Behavior of C as function of the number of vertices compared with two 
texts, (b) The same index as function of the number of sentences, for 15 texts. 
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Fig. 4. Clustering coefficient and average shortest path length evolution, according 
to the number of sentences in James Joyce's Ulysses, in the original English version 
and in a Portuguese translation. 

ing coefficient. In the same way, the own structure of the sentences and the 
frequency of the words along the whole text affects C and other indices as 
well. In order to evaluate the role of these main agents in the determination 
of the indices, we have re-analyzed the filtered texts after submitting them 
to four shuffling processes, which arc so characterized: {SA) In this process, 
the beginning and the end of sentences are kept fixed, however the position of 
the words along the whole text is randomly changed. This procedure breaks 
completely the structure of concepts in the sentences, but their lengths and 
the frequency of the words in the whole text are kept unchanged. {SB) Here, 
the original sequence of the words along the text remains unchanged, but all 
sentences are forced to have the same average length, as obtained from the 
analysis of the unperturbed text. {SC) As in SA, the beginning and the end 
of sentences are kept fixed, and words are randomly chosen from the same 
vocabulary of the text. However, the words frequency is changed to a white 
distribution, so that all the words have the same choice probability. {SD) 
Here, we randomly distribute the same number of links as in the network of 
the filtered text among the nodes, i.e., obtaining an Erdos-Renyi network. 

A summary of results is also included in the Tabic I. They show that the 
networks are affected in different ways but, as expected, for all but the SD 
procedure, they are very far from random networks. The indices for SA and SB 
are not significantly altered. In the first case it shows that, without changing 
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Fig. 5. Cumulative degree distributions P{k) obtained with one single text, for the 
filtered version (solid), and the shuffle procedures SA (dashed), SB (dotted), and 
SC (dash-dotted). 

the structural aspects of sentence sizes and word frequencies, the network is 
not essentially affected. In the second one, breaking sentence sizes but keeping 
words in their original places is not sufficient to alter, in a meaningful way, 
the network structure. These traits are corroborated by Figure 5, where we 
show the 4 cumulative degree distribution 



for a given text and three shuffled versions of it. We realize that the scale-free 
character, expressed by a linear decay in the k interval [10,300] for P{k) re- 
mains almost unchanged. These results remind Zipf's universal result about 
distribution frequency of words in texts since, when Zipf's law is not followed, 
e.g., SC and SD, the network structure is broken. Indeed, not only the power 
law for P{k) is lost, but also the indices in Table I have been altered. Nev- 
ertheless we emphasize the fact that, despite the change in word frequency 
caused by SC sets up a deep modification in the text, its value for C is much 
higher than that one for SD. This indicates that the unchanged distribution 
of sequence sizes still keeps the SC network far away from the Erdos-Renyi 
scenario. 

The results of the dynamical analysis call our attention to general aspects of 
network growth. Barabasi and Albert [10] observed that the scale- free behavior 
may be explained by a special kind of growth process, known as preferential 
attachment. They proposed a model that captures such behavior: vertices are 
added to the network, systematically, by connecting them to some just exist- 
ing vertices, which are selected in accordance with a probability distribution 
that depends on their degrees. However, the values obtained for C arc much 
lower than those in small-world networks. On the other hand, the Watts and 
Strogatz small-world, obtained by randomly rewiring a regular network, does 
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Table 1 

Average values for the indices. Filtered texts were used as standard subjects, and 
shuffling operations were carried out on networks generated by them. Original means 
text without filtering, used just for the purpose of comparison. 

not reproduce the scale-free feature[9] Therefore, we may suggest that the 
growth process we employ here is essential to capture both quoted behaviors. 
It is characterized by the addition of a new sentence, i.e., a complete subgraph 
or clique in each step. Due to the frequency distribution of w^ords along the 
text, the attachment of these complete subgraphs is still preferential, but the 
new vertices are highly clustered, what contributes to the coexistence of both 
small-world and scale-free network scenario. Thus, the model of preferential 
attachment by complete subgraphs according to a pre-established frequency 
of vertices seems to be more suitable to describe the concomitant emergence 
of these behaviors. 

To conclude, we would like to stress that our major purpose, to analyze net- 
work structure among concepts expressed by significant words in written texts, 
has been successfully addressed. The obtained indices point to the presence 
of both small-world and scale-free features. Moreover, we also characterized 
important aspects observed as the text grows, and studied the influence of 
distinct shuffling procedures of words and sentence size distribution. As ex- 
pressed in the first paragraphs, our results may be of significance to shed light 
into the deeper problem of how the human mind works. We have briefly men- 
tioned some efforts concerning the construction and analyzes of networks that 
represent the relations between neurons in brain cortex[17], such networks 
presenting both small-world and scale-free behavior. 

The relation mind-brain is a subject of great controversy, but we wonder 
whether the universal character of the network behavior we have found in 
our work does not point out to some patterns present in the human mind. 
Indeed, due to the intense intellectual activity required to produce texts, it is 
natural to expect that the results may help understanding the way our mind 
works. 

Patterns that might control the way we store concepts in our minds and could 
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reveal aspects of the structure of our brains. It is also worthy to remember that 
similar patterns to those we obtained here have been observed in networks of 
semantically related words. Thus, it is natural to conjecture that the results 
we found do reveal the existence of links between all these research areas. 
Certainly, there is much to be explored in such text networks, as this is a 
vast field and the possibihty of a joint effort from several different areas of 
investigation is open. 
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